Estimation in regret-regression using quadratic inference functions with ridge estimator

In this paper, we propose a new estimation method in estimating optimal dynamic treatment regimes. The quadratic inference functions in myopic regret-regression (QIF-MRr) can be used to estimate the parameters of the mean response at each visit, conditional on previous states and actions. Singularity issues may arise during computation when estimating the parameters in ODTR using QIF-MRr due to multicollinearity. Hence, the ridge penalty was introduced in rQIF-MRr to tackle the issues. A simulation study and an application to anticoagulation dataset were conducted to investigate the model’s performance in parameter estimation. The results show that estimations using rQIF-MRr are more efficient than the QIF-MRr.


Introduction
A dynamic treatment regime (DTR) is a branch of personalized medicine that uses information from the patient to minimize health problems. In reality, the treatment response for each patient is different, which influenced to the development of the DTR and personalized medicine. The advantages of personalized medicine include the following: a cutback in the total cost of health care; the patient receiving an option to intensive health care by deciding an optimal decision for the treatment; and increased compliance and devotion towards treatment [1].
Throughout the years, researchers have shown an interest in DTR. For example, [2] described anticoagulant data to obtain an optimal strategy for the warfarin-treated patient who is at risk of thrombosis (i.e., abnormal blood clotting). Another example of DTR is the estimation of the optimal time by [3] for an asymptomatic HIV-infected subject to start highly active retroviral therapy (HAART). Others include [4][5][6][7].
DTR, also known as adaptive strategies, adaptive interventions, or treatment policies, is a multi-stage decision rule of personalized medicine. It defines a set of decision rules to determine the treatment of an individual based on their health condition and treatment history. The term "dynamic" indicates a variation of treatment using the patient's current state and previous treatment decisions. The purpose of DTR is to optimize the mean outcome for each patient, also called optimal dynamic treatment regime (ODTR).

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0271542 July 21, 2022 1 / 15 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 [8] proposed a regret function to estimate ODTR semi-parametrically and parameterized the optimal rules using the iterative minimization of regrets (IMOR) method. Following the development of ODTR, [9] proposed a regret-regression method where the ODTR was estimated by incorporating the regret function into regression modeling. A doubly robust version of the regret regression by [10] claimed to be equivalent to a reduced form of the efficient gestimation method [11,12]. [13] introduced the myopic regret-regression (MRr) which is a short-term strategy of the regret-regression. In a short-term strategy, the ODTR was estimated at each time point. The MRr provides good estimates when the outcome is measured through time. However, in the previous works mentioned above, no attention has been given to the correlation within-subject. ODTR is actually a longitudinal dataset where the observations of one subject are dependent on each other over time. Many studies made use of longitudinal data, for example in [14][15][16].
To estimate ODTR for correlated data, [13] proposed a method called QIF-MRr which integrates the MRr into the quadratic inference functions (QIF). The QIF method was proposed by [17] where it is an extension to the generalized estimating equations (GEE) [18] which has widely been used in the analysis of longitudinal and correlated data [19][20][21][22]. Hence, the QIF-MRr is able to provide unbiased and efficient estimates even with the misspecified working correlation structure.
In the early years [23], proposed the penalized estimating equations, which considered a bridge penalty, and applied them to the GEE method. Meanwhile, [24] proposed the penalized QIF by using the penalized spline and QIF method. In this paper, we proposed an estimation strategy called the ridge quadratic inference function for myopic regret-regression (rQIF-MRr) to estimate ODTR for correlated data. The ridge penalty, also known as L 2 penalty [25]. The reason we choose the ridge penalty over the other penalties is that the ridge penalty can control inflation and general instability related to the least square estimates [26] and it also can solve singularity issues [27].
The paper is organized as follows: We first define the notations and assumptions needed throughout the paper. Then, we propose the rQIF-MRr in estimating ODTR. The proposed method is illustrated using simulation and application to anticoagulant dataset.

Methods
In this section, we will propose a ridge quadratic inference function for myopic regret-regression (rQIF-MRr) to estimate the parameter for ODTR. We follow the notations obtained from [9]. Suppose, • j = 1, 2, . . ., K is the number of time visits, with K is the final time visit for subject i.
• S j represents the current state of the subject i at visit j.
• A j represents an action or treatment decision made for subject i at visit j.
. . . ; S j Þ is the cumulative information of the states for subject i measured from the first to the jth visit.
. . . ; A j Þ is the cumulative actions given to subject i from the first to the jth visit based on the previous history ð � S jÀ 1 ; � A jÀ 1 Þ.
• The action given at visit j is determined by the state of the subject at visit j with the actions given at the previous visits, A j−1 .
• The states and actions ð � S j ; � A j Þ were assumed to be independent between the subjects and dependent within a subject.
• The response Y j were measured at each visit j.
• From the notion of the potential outcome or counterfactual [28,29], a j is a set of all possible actions that can be given at visit j.
• Yð� a K Þ is the potential outcome under the possible action � a K .
• d opt j is the optimal dynamic treatment regime which optimizes the expected value of outcome Y j .
• EðYðd opt j Þj � S j ; � A jÀ 1 Þ is the expected value of the potential outcome or counterfactual final responses • EðYða j ; d opt jþ1 Þj � S j ; � A jÀ 1 Þ is the expected value of the potential outcome if action a j is chosen at time j and then subsequently the optimal decision rules are followed.
Three assumptions were made in this study, which are consistency, no unmeasured confounders, and positivity. The assumption of consistency is when the observed outcome Y is equal to the potential outcome Yð� a K Þ and the observed state history � S K is equal to the potential state history � S � a KÀ 1 under the observed treatment a K = A K . The treatment is given in a way that it is possible for all the treatment options to be assigned to all patients in the population under consideration.
The assumption of no unmeasured confounders is that the decision for each treatment depends only on the observed states and treatment history. For any regime � a K , the action given at visit j, A j is independent of any future or potential states or outcome given the previous history, for j = 1, 2, . . ., K. If there is no drop-out, the assumption is equivalent to the exchangeability. If the subjects are censored, further assumption is needed where the censoring is non-informative conditional on history. That is, the potential outcome of a censored patient will follow the same distribution as the uncensored patients.
The third assumption about positivity is that the optimal treatment regime has a nonzero or positive probability of being observed in the data. In continuous treatment, the optimal treatment regime is identifiable from the observed data. The assumption may be theoretically and practically violated. A theoretical violation occurs when the study's design prevents a patient from receiving a specific therapy. Practical violations, on the other hand, occur when a portion of the patient has a very low probability of receiving therapy.

Ridge quadratic inference function for myopic regret-regression (rQIF-MRr)
Suppose the mean response of the MRr from [13] is where Þ depend on the history before visit j of the states and actions [9,10]. Meanwhile, the residuals, Þ is a linear combination between S j and the expected value of S j given the his- cÞ is zero if the optimal action is selected at visit j. Otherwise, the regret function is positive since the aim is to maximize the response Y j [9]. [17] defines R(ρ) to be the working correlation matrix with parameter ρ, and the inverse function of R −1 (ρ) can be approximated by a linear combination of several basis matrices defined as where τ 1 are unknown coefficients and M l are known basis matrices.
There are several types of working correlation structures [17], but in this paper, we only focus on three common types of working correlation structures, which are the first order autoregressive, AR(1), exchangeable and unspecified working correlation structures. The estimating equation of the QIF-MRr is and can be written in the form of its extended score as where g i (β, ψ) is the estimating equations from Eq (3), and the linear coefficients, τ l can be viewed as nuisance parameters [24]. The D i is a diagonal matrix of the marginal variances for subject i, and M 1 , . . ., M m are the basis functions for the working correlation matrix with dimension (K × K). The derivatives of the h j (β, ψ) term, @h T /@(β, ψ) and vector g i ((β, ψ)) have (p × K) dimension where p is the number of parameters (β, ψ). As there are more equations than the parameters, the generalized method of moments (GMM) from [30], can be applied to create the QIF-MRr as where, The parameters β and ψ can be estimated by setting the extended score, g n (β, ψ) in Eq (4) as close to zero as possible, which is by minimizing the Q n (β, ψ) function as ðb;ĉÞ ¼ arg min ðb;cÞ Singularity problem often occurs during estimation using QIF-MRr in ODTR. [27] used a ridge-regression to solve singularity problem. Thus, by applying the penalized QIF from [31], we introduce the ridge penalty in QIF-MRr and define a new Q n function of Eq (5) as where λ is the tuning parameter, and P p v¼1 jðb; cÞ v j 2 is a ridge penalty function where v = 1, . . ., p is the number of parameter (β, ψ), and p is the total number of parameter (β, ψ). The penalty function will act as a weight during estimation, and stabilize the estimation of the parameter in the computation. The rQIF-MRr minimizes the parameters β and ψ as ðb;ĉÞ ¼ arg min ðb;cÞ

Results and discussions
The parameter estimates of the proposed method were investigated using simulations. The simulation dataset were generated sing the scenario from [8] in order to estimate the mean response Y j at each visit. Let i = 1, 2, . . ., n be the observed subject, and n is the sample size. Then, j = 1, 2, . . ., K is the time visit where K = 10 is the final visit. We generate the first state S 1 for each i from a normal distribution with mean 0.5 and variance 0.01. For the second state onwards, the states S j � N(m j , 0.01), where m j ¼ 0:5 þ 0:2S jÀ 1 À 0:07A jÀ 1 ; for j = 2, 3, . . ., K. In this simulation, we only consider one action per visit. The action, A j were generated from uniform distribution, A j � U{0, 1, 2, 3}.
The regret function is defined as where min where I j is a vector of length K for each subject. The response, Y j is generated from a normal distribution with mean, and variance s 2 The residuals, 2 � Nð0; s 2 Y SÞ where S is obtained from an autoregressive true correlation matrix. Applying the Cholesky decomposition, S = CC T which decomposed a positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. We take � = C W where W � N(0, I R ). Thus, Then, the response variable for each subject i is The correlation in the response variable was levelled into three levels: ρ = 0.1, 0.5, 0.95. ρ = 0.1 indicates a low correlation, while ρ = 0.5 is a medium correlation, and ρ = 0.95 is a high correlation. The initial parameter values for the coefficients are β = {3, −5} and ψ = {1.5, 0.1, 5.5}.
To estimate the parameters, we first regress each S j on history ð � S jÀ 1 ; � A jÀ 1 Þ and define Z j [9]. Then, we differentiate h j (β, ψ) from Eq (11) with respect to β 0 , β 1 , ψ 1 , ψ 2 and ψ 3 to obtain @h/ @(β, ψ). For each subject i, the partial derivative with respect to β 0 is and the partial derivative with respect to β 1 is where Z is the residual vector for subject i. The partial derivatives with respect to ψ 1 , ψ 2 and ψ 3 are then @h @c 1 ¼ À I; where I is an indicators sign of the regrets for subject i from Eq (10).
The parameters ðb;ĉÞ can be obtained using the optim built-in function in R by minimizing Eq (8). The tuning parameter, λ was obtained using a cross-validation technique [32]. The optimal tuning parameter for this simulation is λ = 0.01. For the simulation, we used bootstrap resampling 1000 times with two sample sizes (n = 25 and n = 500). [33] considered that n = 25 as a small sample size. Hence, we used a sample size n = 25 to compare the performance in estimates between the rQIf-MRr and QIF-MRr methods. We fit the models (i.e. QIF-MRr and rQIF-MRr) using AR(1), exchangeable, and unspecified working correlation structures.
For the AR(1) working correlation structure, the inverse working correlation structure can be written as where M 0 is an identity matrix Here For an exchangeable working correlation structure, R(ρ) consists of 1's on the diagonal and ρ's everywhere off-diagonal. Then, R −1 is given as where M 0 is an identity matrix and M 1 is a matrix with diagonal elements 0 and off-diagonal elements 1 The unspecified working correlation structure can be used to determine the working correlation structure when there is some difficulty and challenges in obtaining it. For the unspecified working correlation structure, the basis matrices M 0 = I n and M 1 ¼Û , wherê and the matrixÛ is a consistent estimator of the variance matrix of Y [19]. The correctly specified working correlation structure is when we use the AR(1) working correlation structure to generate the data and fit the models. Otherwise, the model is considered a misspecified working correlation structure. The results below give the mean value of the parameter estimates for 1000 bootstrap resampling (Mean), standard error (SE), and the root mean square error (RMSE).
With sample sizes of n = 25 and n = 500 at different correlation values, ρ, Table 1 compares parameter estimations using QIF-MRr and rQIF-MRr. The rQIF-MRr is more efficient than the QIF-MRr for small and large sample sizes, with small SE and RMSE.
The results for the correctly specified working correlation structure was shown in Table 1, where the data is generated and fitted using AR(1) working correlation structure. The advantage of using the QIF-MRr and rQIF-MRr in estimation is that the estimates are still efficient even with the misspecified working correlation structure. The estimation for the misspecified working correlation structures for both QIF-MRr and rQIF-MRr are given in Tables 2-4. For low correlation, ρ = 0.1 in Table 2, the parameter estimation for rQIF-MRr are unbiased and efficient with small SE and RMSE even with misspecified working correlation structures. The correctly specified AR(1) for rQIF-MRr gives slightly smaller SE compared to exchangeable and unspecified working correlation structures.
When the correlation is medium (ρ = 0.5), and high (ρ = 0.95) in Tables 3 and 4 respectively, estimation using misspecified working correlation structures still gives unbiased and efficient estimates with small SE and RMSE. Although the true model (i.e. AR(1)) gives slightly better estimates, but the difference is not far. This is one of the advantage of using the QIF in estimation, where the parameter estimates is efficient even with the misspecified working correlation structure [24,34].

Application to Warfarin data
For application, we use the data from Warfarin-treated patients who at risk of thrombosis [2]. The data consists of 303 patients with 14 clinic visits. The states, S j , is defined as the difference between the International Normalized Ratio (INR) at visit j and the INR within the target range. Meanwhile, A j is defined as the dose given to patients at each visit j. Suppose the INR is used to measure blood clotting speed, and a positive S j indicates that the clotting time was too long and that the dose A j should be reduced, and vice versa. The goal of the treatment is to make sure the INR is within the target range. From the 14 clinic visits, only 9 visits were considered, where the first 4 visits were treated as a stabilization period, and the last visits had no contribution to the outcome. At each visit j, the response Y j is measured for j = 1, 2, . . ., K with K = 9. Hence, the mean response for Y j conditional on ð � S j ; � A j Þ as in Eq (1). The mixture model for S j is used to obtain the state residuals, Z j . The model consists of a logistic component for P(S j = 0) and linear component for |S j | given (S j 6 ¼ 0). The regret function is modeled as in Eq (9). The first step will be estimating the residual of the state function, Z j . Then, for each i, we estimate the g i (β, ψ) functions, the extended score matrix g n (β, ψ), and the partial derivatives @h T i =@ðb; cÞ. In estimation, the initial value for (β, ψ) = {8.00, 0.00, 2.00, 0.25, −5.00} given that (β, ψ) is unknown for this application [13].
The parameter estimates of ðb;ĉÞ can be obtained as in Eqs 6 and 8 for QIF-MRr and rQIF-MRr respectively. Using a cross-validation technique, the optimal tuning parameter λ = 0.1375436. 100 bootstrap resamplings were performed to test the consistency and efficiency of the estimation using AR(1), exchangeable, and unspecified working correlation structures. Table 5 shows the results of parameter estimations for Warfarin data using the QIF-MRr and rQIF-MRr with three different types of working correlation structures. In comparison to AR(1) with an unspecified working correlation structure, estimation using rQIF-MRr with an exchangeable working correlation structure is more efficient with a smaller SE. Estimation using the rQIf-MRr with the AR(1) working correlation structure produces better results than the QIF-MRr. Meanwhile, the findings for both methods are almost similar when we estimate the parameters using an exchangeable and unspecified working correlation structure.

Conclusions
The rQIF-MRr method was proposed in this paper with the goal of improving estimation in ODTR. There is huge potential to explore rQIF-MRr in personalized medicine, particularly in estimating ODTR. The study of ODTR is a branch of personalized medicine and it is a promising and developing field.
The simulation studies show that the parameter estimation using rQIF-MRr is unbiased and efficient even with the misspecified working correlation structure. In the simulation analysis, the proposed rQIF-MRr method performed well for small and large sample sizes at any correlation level. In comparison to the QIF-MRr, parameter estimation using the rQIF-MRr gives an efficient estimate with a minimal standard error when applied to Warfarin data.
Comparisons of different methods for determining the optimal tuning parameter λ, such as generalized cross-validation technique [35,36], are in our best interest. Genetic algorithm [37] and particle swarm optimization [38] are two other methods.
When working with participants or patients with a defined time study in ODTR, there may be dropouts during data collection. It's possible that this will result in either missing or survival data. The proposed method is incompatible with this type of data. Improvisation is required to deal with missing and survival data.